Before first start of FL

In this document we highlight how one should proceed with setup before running the .ipynb FL script the first time.

Only thing needed for this tutorial is an Azure subscription and resource group. Most of the time, you need to request this from your Azure admin.

Adjustments to clients

These steps are necessary on every single client that will be used for federated learning.

Add Persistent Volume Claim to cluster

In order to train on our data, we utilize Kubernetes objects called Persistent Volume (PV) and Persistent Volume Claim (PVC). These are used for isolation between the host CPU and cluster. Here one can specify what kind of data should be shared into kubernetes cluster. If you want to train only on data situated on cloud, you don't need to do this setup.

We have pre-configured PV and PVC in the k8s/ihd folder. The PV:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: fl-ich-data
  labels:
    type: local
spec:
  storageClassName: manual
  claimRef: #used to bound pv to pvc
    namespace: default #ADJUST - PVC namespace under
    name: fl-data-claim #ADJUST - PVC name under
  capacity:
    storage: 10Gi #this is the max size that will be matched, no matter capacity in PVC
  accessModes:
    - ReadOnlyMany #change this if you need to write to local filesystem
  hostPath:
    path: "/home/fldata" #ADJUST - local path to mount to pv, doesn't have to match one in PVC

and PVC being:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: fl-data-claim 
  namespace: default
  labels:
    ml.azure.com/pvc: "true"
  annotations:
    ml.azure.com/mountpath: "/home/fldata" #ADJUST - path used for mounting to AML, can be different from path in PV, used for mount_location in centralised ARC setting
spec:
  storageClassName: "manual"
  accessModes:
  - ReadOnlyMany   #change this if you need to write to local filesystem
  resources:
     requests:
       storage: 10Gi #how much data we assume to need
  volumeName: "fl-ich-data" #ADJUST - name of the PV above

Both need to be deployed through

kubectl apply -f <file_name>

with PV being first and PVC second.

The deployment of PV and PVC can be checked through kubectl get pv -A and kubectl get pvc -A respectively.

Make sure to check local bind path, name of PV and PVC in both files. More informations about what to change are present in the respective pv and pvc .yaml files.

PV and PVC are by default deployed to the default namespace, as are AML jobs submitted through AML studio. If you decide to change that, please also redeploy PV and PVC to other namespace.

For more information on possible storages bindable as PV and PVC, please see official kubernetes documentation.

First time environment setup

Launch AML Studio

This tutorial will work with Microsoft Azure Machine Learning Studio. You can launch it upon opening the newly created AML Workspace resource.

Fig. 4 - Location of AML Studio link

Clone the repo

Now you need to know the FL repository into the Azure ML Studio. In order to do so, navigate into "Notebooks". As far as we are aware, AML should automatically create a folder with your name. Navigate there and open terminal. Note, in order to launch terminal, your Compute Instance must be launched!

Fig. 9 - Open terminal

From here, you will need to head to official MS documentation page that higlights the AML and Git integration, but in general, the git workflow should be same as always. Current repository is being hosted on code.siemens.com.

Please, contact ladislav.pomsar@siemens-healthineers.com or pavol.kincel@siemens-healthineers.com for access to AIRC FL group. Under this group, clone AIRC FL Code repository.

Adjustments to code before starting the script

Great job, you have adjusted everything needed for the first run! Now it is time to adjust the clonned project to reflect your real environment. Lets start with conda environment. The information are current as of May 2023.

Adjust client conda environment

Under /dev/submit-jobs-compute-targets/ one should be able to find conda-flare.yaml. This .yaml file is defining the conda environment on the NVFlare clients. Change accordingly, depending on what python libraries you plan to utilize. This environment is dynamically built during the deployment, so there is no need to set something else in AML Studio.

We recommend setting a specific versions for each library, this decrease the chance of unexpected errors. In order to be sure that such a configuration work, we recommend creating a conda environment on local PC and run the job locally.

Adjust server conda environment

While the client enviroment is created via .yaml file present in the files, the server enviroment assumes to need only a few libraries actually (not doing computation per se) and thus being fine with a static configuration. In the code, it is currently called "nvflare-server@latest" in "Start Flare Server in Container" section of the submit-fl-jobs.ipynb.

Such a static enviroment can be created in the "Environments" section of left menu of AML Studio.

Fig. 10 - Open terminal

You can there create an environment with "Select environment source" selected as "Create a new docker context", fill the section and click next. In the second step, you will be able to input a Dockerfile. In the MVP, we resorted to following Dockerfile:

Fig. 10 - Open terminal

FROM mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04

RUN apt-get update && \
    apt-get install -y python3.8 python3-pip

RUN pip3 install azure-ai-ml nvflare torch torchvision scikit-learn tensorboard ipykernel azure-identity mlflow azureml-mlflow 'monai[nibabel, ignite, tqdm]'

Adjust project definition

In order to do proper provisioning, it is necessary to change the ihd-project.yml found under fl-mvp/dev/nvflare. Currently, the file looks like:

api_version: 3
name: FL-mvp
description: Training a pneumonia model based on chest x-ray images

participants:
  # input your DNS assigned FQDN
  - name: fl-mvp-shs.westeurope.cloudapp.azure.com
    type: server #client/server
    org: nvidia
    fed_learn_port: 8002 #One of the ports you opened above
    admin_port: 8003 #Second of the ports you opened above
    enable_byoc: true #

  - name: gpu-cluster-central #name of the compute, in our case this was a cloud machine
    type: client
    org: nvidia
    enable_byoc: true
    aml_workspace: central-workspace #what workspace does the machine belong to, you can have several machines in several workspaces for degree of separation (lets say we don't want other datascientists accessing this machine and possibly data)
    data_source: datastore
    data_path: azureml://subscriptions/08217dea-11b6-4ce7-b44e-b82c6345f2a2/resourcegroups/fl-mvp/workspaces/central-workspace/datastores/workspaceblobstore/paths/pneumonia-sites/gpu-cluster-central #These are data uploaded to the blob storage in azure

  - name: hospital-buddha
    type: client
    org: nvidia
    enable_byoc: true
    aml_workspace: central-workspace
    data_source: container-folder #pvc in kubernetes, not needed to specify the name

  - name: hospital-shiva
    type: client
    org: nvidia
    enable_byoc: true
    aml_workspace: central-workspace
    data_source: container-folder

  - name: hospital-vishnu
    type: client
    org: nvidia
    enable_byoc: true
    aml_workspace: central-workspace
    data_source: container-folder
    
  - name: admin@nvidia.com
    type: admin
    org: nvidia
    role: project_admin

# The same methods in all builders are called in their order defined in builders section
builders:
  - path: nvflare.lighter.impl.workspace.WorkspaceBuilder
    args:
      template_file: master_template.yml
  - path: nvflare.lighter.impl.template.TemplateBuilder
  - path: nvflare.lighter.impl.static_file.StaticFileBuilder
    args:
      # config_folder can be set to inform NVIDIA FLARE where to get configuration
      config_folder: config

      # app_validator is used to verify if uploaded app has proper structures
      # if not set, no app_validator is included in fed_server.json
      # app_validator: PATH_TO_YOUR_OWN_APP_VALIDATOR

      # when docker_image is set to a docker image name, docker.sh will be generated on server/client/admin
      # docker_image:

      # download_job_url is set to http://download.server.com/ as default in fed_server.json.  You can override this
      # to different url.
      # download_job_url: http://download.server.com/

      overseer_agent:
        path: nvflare.ha.dummy_overseer_agent.DummyOverseerAgent
        # if overseer_exists is true, args here are ignored.  Provisioning
        #   tool will fill role, name and other local parameters automatically.
        # if overseer_exists is false, args in this section will be used and the sp_end_point
        # must match the server defined above in the format of SERVER_NAME:FL_PORT:ADMIN_PORT
        # 
        overseer_exists: false
        args:
          sp_end_point: fl-mvp-shs.westeurope.cloudapp.azure.com:8002:8003
        ##Make sure to adjust depending on your FQDN and ports
  - path: nvflare.lighter.impl.cert.CertBuilder
  - path: nvflare.lighter.impl.signature.SignatureBuilder

To elaborate a bit more, let's take a look at what we got now. In the top section, we got participants. Every participant got name and type. Type says whether it is server, client or admin. Based on the type, there are different options for every type of participant:

server - here unique values are fed_learn_port and admin_port. Those are ports, that have been specified and opened in server setup. Name is the FQDN of the server.
client - here, the name is the name of the compute, either Compute clusters or Kubernetes clusters from Compute tab of AML. aml_workspace is the workspace where cluster is deployed. data_source and data_path tells where the data are stored. Here you can see datastore, for Azure storage and container-folder, for k8s cluster. For other data source possibilities, please see AML documentation.
admin - here only special row is the role. There are several different roles, such as member, lead, org admin and project admin. Their capabilities after login to admin differ, based on this role. For more information, please reffer to official NVIDIA documentation.

Once we have done the setup, we can head over to submit-fl-jobs.ipynb to test the code. The .ipynb provides enough information itself.

Before first start of FL

Adjustments to clients​

Add Persistent Volume Claim to cluster​

First time environment setup​

Launch AML Studio​

Clone the repo​

Adjustments to code before starting the script​

Adjust client conda environment​

Adjust server conda environment​

Adjust project definition​